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Abstract 

Inference in popular nonpar ametric Bayesian 
models typically relies on sampling or other 
approximations. This paper presents a 
general methodology for constructing novel 
tractable nonparametric Bayesian methods 
by applying the kernel trick to inference in 
a parametric Bayesian model. For exam- 
ple, Gaussian process regression can be de- 
rived this way from Bayesian linear regres- 
sion. Despite the success of the Gaussian 
process framework, the kernel trick is rarely 
explicitly considered in the Bayesian litera- 
ture;. In this papcir, wc; aim to fill this gap 
and demonstrate the potential of applying 
the kernel trick to tractable Bayesian para- 
metric models in a wider context than just 
regression. As an example, we present an in- 
tuitive Bayesian kernel machine for density 
estimation that is obtained by applying the 
kernel trick to a Gaussian generative model 
in feature space. 

1 Introduction 

The popularity of nonparametric Bayesian methods 
has steadily risen in machine learning over the past 
decade. Bayesian inference in almost all current non- 
parametric models relics on approximations which typ- 
ically involve Markov chain Monte Carlo, or more re- 
cently variational approximations [1]. There is an ever- 
growing interest in developing nonparametric Bayesian 
methods in which inference and prediction can be 
expressed in closed form and no approximations are 
needed. Such methods would be quite useful and de- 
sired in many applications, but are unfortunately rare 
at present. Perhaps the only known example is Gaus- 
sian process (CP) regression. In CP regression [2], the 
posterior predictive distribution is a tractable Gaus- 
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sian with parameters that can be computed from data 
exactly in polynomial time. The method is widely 
adopted and also has favourable frequentist asymp- 
totic properties [3]. But what is the secret behind 
the remarkable algorithmic clarity of GP regression? 
What makes closed-form computations possible? We 
argue that the key is that GP regression is also a ker- 
nel machine: the method is arrived at by applying the 
kernel trick in a parametric Bayesian model, namely 
linear regression (see e.g. Chapter 2 in [2]) 

Kernel methods [4, 5] use a simple trick, widely known 
as the kernel trick, to overcome the limitations of a 
linear model: the observations Xi Cz X are first em- 
bedded in a feature space J- using a nonlinear map- 
ping Lp : X 1-^ J^. A linear algorithm is then applied 
on the embedded representations </>„ = if{xn) instead 
of the observations themselves. If the algorithm only 
makes use of scalar products (</>„, 0m), then by re- 
placing all scalar products by tractable kernel evalua- 
tions k(xn, Xm) '■= {<pn, 0m), the cxprcssivc power can 
be substantially increased with only a minor increase 
in computational costs. Notably, the kernel trick al- 
lows one to construct nonparametric machine learning 
methods from parametric ones. 

Despite its popularity in "non-Bayesian" studies, the 
kernel trick is rarely considered as a construction tool 
in Bayesian nonparametrics. GP regression is a rare, 
if not the only example. CPs are therefore often 
called Bayesian kernel machines. In this paper, we 
consider finding new examples of Bayesian kernel ma- 
chines, i. e. broadening the intersection between ker- 
nel machines and nonparametric Bayesian methods. 
For Bayesian nonparametrics, the kernel approach of- 
fers invaluable closed-form computations and a rigor- 
ous analysis framework associated with reproducing 
kernel Hilbert spaces (RKHS). For kernel machines, 
a Bayesian formulation offcirs benefits, such as novel 
probabilistic approaches to setting kernel hyperparam- 
eters and means of incorporating such methods in hi- 
erarchical Bayesian models [6, 7, 8]. 



In this paper, wc present a methodology for finding 
novel Bayesian kernel machines, following the recipe 
of GP regression: 

1. Start with a simple Bayesian model of observa- 
tions. 

2. Derive exact Bayesian inference in the model. 

3. Express the posterior predictive distribution in 
terms of dot products and apply the kernel trick. 

The crucial point is finding the basic model in step 
1, in which both steps 2 and 3 are possible. Fortu- 
nately, this search is guided by intuitive orthonormal 
invariance considerations that will be demonstrated in 
this paper. We present an example Bayesian kernel 
machine for density estimation, based on the linear 
Gaussian generative model underlying principal com- 
ponent analysis (PC A) [9]. We show that the kernel 
trick can be applied in the Bayesian method by choos- 
ing prior distributions over the parameters which pre- 
serve invariance to orthonormal transformations. 

The rest of the paper is organised as follows. In Sec. 2, 
we review Bayesian inference in a Gaussian genera- 
tive model and discuss consequences of applying the 
kernel trick to the predictive density. We consider 
the infinite dimensional feature spaces case in Sub- 
sec. 2.2. In Sec. 4, we present experiments on high di- 
mensional density estimation problems comparing our 
method to other Bayesian and non-Bayesian nonpara- 
metric methods. 

2 A Gaussian model in feature space 

Assume that wc have observations Xi in a d- 
dimensional Euclidean space and that our task is to 
estimate their density. In anticipation of the sequel, we 
embed the observations Xi into a high-dimensional fea- 
ture space with an injective smooth nonlinear map- 
ping ip : X J^. Our density estimation method is 
based on a simple generative model on the ambient fea- 
ture space of the embedded observations 0i = ifi{xi). 
For now, think of ^ as a Z)-dimensional Euclidean 
space and <pi as arbitrary elements of T (i. e. they 
are not necessary constrained to lie on the observa- 
tion manifold O = {f{x) : x e X}). We suppose that 
(pi were sampled from a Gaussian distribution with 
unknown mean /u and covariance S: 

(Pl.,N\^l,■Sr^^^{n,■E),iA.d. (1) 

The notation 0i:jv is used to denote the set of vectors 
01 up to (pN- 

Now we will consider estimating parameters of this 
model in a Bayesian way. In Bayesian inference, one 



defines prior distributions over the parameters and 
then uses Bayes' rule to compute the posterior over 
them. Importantly, now we also require that the re- 
sulting Bayesian procedure is still amenable to the ker- 
nel trick. We therefore start by discussing a necessary 
condition for kernelisation which can then guide our 
choice of prior distributions. 

The kernel trick requires that the algorithm be ex- 
pressed solely in terms of scalar products {<pi,<pj). 
Scalar products are invariant under orthonormal trans- 
formations of the space, i. c. if A is an orthonormal 
transformation then {u, v) = {Au, Av) for all u, and 
V in the space. Thus, if one wants to express an al- 
gorithm in terms of scalar products, it has to be at 
least - invariant under orthonormal transformations, 
such as rotations, reflections and permutations. It is 
well known that PGA has this property, but, for exam- 
ple, factor analysis (FA) does not [9], thus one cannot 
expect a kernel version of FA without any restrictions. 

Another desired property of the method is analytical 

convenience and tractability, which can be ensured by 
using conjugate priors. The conjugate prior of the 
Gaussian likelihood is the Normal-inverse-Wishart, 
which in our case has to be restricted to meet the or- 
thonormal invariance condition: 

/x|S,/3~AA(^0,^S^, 

where W^^ and Af denote the inverse- Wishart and 
Gaussian distributions with the usual parametrisation. 
The general Normal-inverse-Wishart family had to be 
restricted in two ways: firstly, the mean of fi was set to 
zero; secondly, the scale matrix of the inverse- Wishart 
was set to be spherical. These sensible restrictions en- 
sure that the marginal distribution of (p is centered 
at the origin and is spherically symmetric and there- 
fore orthonormal invariance holds, which is required 
for kernelisation. 

Having defined the hierarchical generative model in 
eqns. (1) (2), our task is to estimate the density of 
0's given the previous observations (pi-.N, which in a 
Bayesian framework is done by calculating the follow- 
ing posterior predictive distribution: 

p(0|0i:Jv;cr^,a;,/3) = 

H, S)p(/x, ll|0i:jv; ctq, a, /3)rf/xrfS 

By straightforward computation, it can be shown 
that the posterior predictive distribution is a D- 
dimensional student-T distribution of the form (with 



the dependence on hyper-parameters made imphcit): 

P(0|01:iv) OC (3) 




* = 



where -y = ^+^+^ i = A - ^ 

[01 - 0, . . . , - 0] , = ^ Er^=i is the empir- 
ical mean in feature space and 1 is a A'^ x 1 vector of 
ones. 

In order to obtain an expression which only contains 
scalar products, we invoke Woodbury's matrix inver- 
sion formula [10]: 
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2.1 Kernel trick 



Until now, we assumed that was an arbitrary point 

in D-dimensional space. In reality, however, we only 
want to assign probabilities (4) to points on the so- 
called observation manifold, O = {ip{x) : x G X}, i.e. 
to points that can be realised by mapping an obser- 
vation X £ X to feature space = <f{x). Restricted 
to O, we can actually make use of the kernel trick, 
assuming as well that 0^ = ip{xi) for the previous 
observations. Indeed, Eq. (4) then only depends on 
Xi;N and X through 0^0, 0^^ and which can 

all be expressed in terms of pairwise scalar products 

4>n<Pm = {^{Xn),f{Xm)) = k{Xn,Xm) aS foUowS: 



<P<P-k{x,x) 2 + 
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As said previously, we are only interested in points 
in the (curved) observation manifold. The restric- 
tion of the predictive distribution (4) to the obser- 
vation manifold induces the same density function 



q(ip{x)\xi.j\[) = p{ip{x)\(pi-jf) (which is now unnor- 
malised) , but with respect to a d-dimensional Lebesgue 
base measure in the geodesic coordinate system of the 
manifold. Finally, we map the density q{ip{x)\xi.j^) 
back to the original input space by implicitly inverting 
the mapping <f, yielding the (unnormalised) predictive 
density qk{x\xi-N) defined on the input space. By do- 
ing so, we need to include a multiplicative Jacobian 
correction term which relates volume elements in the 
tangent space of the manifold to volume elements in 
the input space X: 

qk{x\xi:N) = 

(6) 

This formula can be be derived from the standard 
change-of-variable formula for densities, as we show 
in the Appendix A. 

So what exactly did we gain by applying the ker- 
nel trick? Despite the fact that the simple density 
model in the whole feature space is unimodal, in the 
local coordinate system of the non-linear observation 
manifold, the predictive density may appear multi- 
modal and hence possess interesting nonlinear fea- 
tures. This is illustrated in Fig. 1. Assume that 
we observe draws from a complicated distribution, 
which cannot be conveniently modelled with a linear 
Gaussian model (Fig. l.A). We map observations to 
a two dimensional feature space using the mapping 
<fi^{x) = X, ip'^{x) = (x)^ hoping that the embedded ob- 
servations are better fitted by a Gaussian, and carry 
out inference there (Fig. l.B). Note how all possible 
observations in the original space get mapped to the 
observation manifold O, which is now the parabola 
— (0i)2. The outcome of inference is a posterior 
predictive distribution in feature space, which takes 
a student-T form. To obtain the predictive distribu- 
tion in the observation space we have to look at the 
magnitude of the predictive distribution along the ob- 
servation manifold. This is illustrated by the contour 
lines in panel B. The restricted predictive distribu- 
tion is then "pulled back" to observation space and 
has to be multiplied by the Jacobian term (Fig. l.C) 
to yield the final estimation of the density of the ob- 
served samples (Fig. l.D). Remarkably, our method 
computes the unnormalised density estimate (before 
the Jacobian term is added. Fig. l.C) directly from 
the data (Fig. l.A). All intermediate steps, including 
the inversion of the mapping, are implicit in the prob- 
lem formulation, and we never have to work with the 
feature space directly. As the method essentially esti- 
mates the density by a student-T in feature space, we 
shall call it kernel student-T density estimation. 




Figure 1: Illustration of kernel student-T density estimation on a toy problem. A, Fourteen data points Xi:i4 
(crosses) drawn from from a mixture distribution {solid curve). B, Embedded observations <pi;i4{crosses) in the 
two dimensional feature space and contours of the predictive density (?(</>i5|xi:i4). The observation manifold is 
a parabola [solid curve), and we are interested in the magnitude of the predictive distribution to this manifold 
(colouring). C, The restricted posterior predictive distribution pulled back to the real line (solid curve) gets 
multiplied by the Jacobian term \/l + 4a;^ (dotted curve) to give D, the predictive distribution qk(xi5\xi:i4) 
(solid curve) in observation space. All distributions are scaled so that they normalise to one. 



2.2 Infinite dimensional feature spaces 

Of course, we constructed the previous toy example 
so that the simple normal model in this second or- 
der polynomial feature space is able to model the bi- 
modal data distribution. Should the data distribu- 
tion be more complicated, e. g. would have three or 
more modes, the second order features would hardly 
be enough to pick up all relevant statistical properties. 
Intuitively, adding more features results in higher po- 
tential expressive power, while the Bayesian integra- 
tion safeguards us from overfitting. In practice, there- 
fore, we will use the method in conjunction with rich, 
perhaps infinite dimensional feature spaces, which will 
allow us to relax these parametric restrictions and to 
apply our method to model a wide class of probability 
distributions. 

However, moving to infinite dimensional feature spaces 
introduces additional technical difficulties. As for one, 
Lebesgue measures do not exist in these spaces, there- 
fore our derivations based on densities with respect to 
Euclidean base measure should be reconsidered. For 
a fully rigorous description of the method in infinite 
spaces, one may consider using Gaussian base mea- 
sures instead. In this paper, we address two specific 
issues that arise when the feature space is large or in- 
finite dimensional. 

Firstly, because of the Jacobian correction term, the 
predictive density in input space is still not fully ker- 
nelised in (6). We note though that this term is the 
determinant of a small dx d matrix of inner products, 
which can actually be computed efficiently for some 
kernels. For example, the all-subsets kernel [11] has a 
feature space with exponential dimension D = 2"^, but 



each inner product of derivatives can be computed in 
0(d). But what about nonparametric kernels with in- 
finite D7 Interestingly, we found that shift invariant 
kernels - for which k(x, y) — k(x + z,y + z) for all z 
- actually yield a constant Jacobian term (see proof in 
the Appendix B) so it can be safely ignored to obtain 
a fully kernelised - although unnormalised - predic- 
tive density qk(x\xi;M) — q(<.p(x)\xi;tq) . This result is 
important, as the most commonly used nonparametric 
kernels, such as the squared exponential, or the Lapla- 
cian fall into this category. 

The second issue is normalisation: Eq. (6) only ex- 
presses the predictive distribution up to a multiplica- 
tive constant. Furthermore, the derivations implicitly 
assumed that a > D — 1, where D is the dimension- 
ality of the feature space, otherwise the probability 
densities involved become improper. This assumption 
clearly cannot be satisfied when D is infinite. However, 
one can argue that even if the densities in the feature 
space are improper, the predictive distributions that 
we use may still be normalisable when restricted to the 
observation manifold for a > d — 1. Indeed, the two- 
dimensional student-T is normalisable when restricted 
to the parabola as in Fig. 1 for all a > rather than 
a > 1 (to show this, consider = so that (3) takes 
the simple form q((p) = (cnst. + 0^0) ~). So al- 
though distributions in feature space may be improper, 
the predictive densities may remain well-defined even 
when nonparametric kernels are used. 

Given these observations, we thus decide to formally 
apply equation (6) for our predictive density estima- 
tion algorithm, ignoring the normalisation constant 
and also the Jacobian term when shift-invariant non- 
parametric kernels are applied. 



A: a = 0.01, /3 = 0.01 B: a = 3, /3 = 0.01 C: a = 10, /3 = 0.01 




Figure 2: Fantasy data drawn from a two dimensional kernel student-T model with SE kernel. The length-scale 
£ — 1 {bars below panels) and /3 = 0.01 are fixed, a varies between 0.01 and 10 (A to C). Panels show 50 points 
{blue dots) drawn from the generative model and the predictive density {gray level) of the 51^^ draw. 



Finally, we want to emphasize that the method does 
not imply nor does it require that arbitrary distribu- 
tions, when embedded in rich Hilbert spaces, will look 
exactly like Gaussian or student-T measures. Instead, 
the method uses the potentially infinitely flexible fam- 
ily of Gaussian measures to approximately model dis- 
tributions in the feature space, and uses Bayesian inte- 
gration to address model-misspecification via explicit 
representation of uncertainty. As demonstrated in 
Fig. 1, the richness and fiexibility of the density model 
stems from projecting the distribution onto a nonlinear 
observation manifold, allowing us to overcome limita- 
tions of Gaussian measures, such as unimodality. 

3 Related work 

Our method is closely related to kernel principal com- 
ponent analysis [12, kPCA] as both of them essentially 
use the same underlying generative model (Eq. 1). 
While our method is based on Bayesian inference, 
kPCA performs constrained maximum likelihood es- 
timation [13] of parameters fi and S. kPCA is a 
very effective and popular method for nonlinear fea- 
ture extraction. Its pitfall from a density estimation 
perspective is that it does not generally induce a sen- 
sible probability distribution in observation space. To 
understand this, consider performing kPCA of order 
one (i. e. recover just a single nonlinear component) 
on the toy data in Fig. l.A. The solution defines a de- 
generate Gaussian distribution in feature space which 
is concentrated along a line, roughly the long axes 
of the equiprobability ellipses in Fig. l.B. As such a 
distribution can intersect the observation manifold at 
most twice, the predictive distribution in observation 
space will be a mixture of two delta distributions. This 



degeneracy can be ruled out by recovering at least as 
many nonlinear components as the dimensionality of 
feature space, but this is clearly not a viable option 
in infinite or large dimensional feature spaces. The 
Bayesian approach sidesteps the degeneracy problem 
by integrating out the mean and covariance, thereby 
averaging many, possibly degenerate distributions to 
obtain a non-degenerate, smooth student-T distribu- 
tion in feature space, which results in a smooth density 
estimate in observation space. 

Another particularly interesting related topic is that 
of characteristic kernels [14]. A positive definite ker- 
nel k{x,x') — {ip{x),ip{x')) is called characteristic if 
the mean element /ijr = Ex^7r[v'(X)] uniquely char- 
acterises any probability measure tt. From a density 
estimation perspective, this suggests that estimating 
a distribution in observation space boils down to esti- 
mating the mean in a characteristic kernel space. The 
kernel moment matching framework [15] tries to ex- 
ploit characteristic kernels in a very direct way: pa- 
rameters of an approximating distribution are chosen 
by minimising maximum mean discrepancy in feature 
space. Our method can be interpreted as performing 
Bayesian estimation of first and second moments in 
feature space, therefore one may hope that work on 
characteristic kernels will help us further study and 
understand properties of the algorithm. 

The Gaussian process has been used as a nonparamet- 
ric prior over functions for unsupervised learning tasks 
in several studies, such as in Gaussian process latent 
variable models [16, GPLVMs] or the Gaussian pro- 
cess density sampler [8, GPDS]. What sets the present 
approach apart from these is that while they incorpo- 
rated Gaussian processes as a building block in an un- 



METHOD 


NOVELTY (AuC[%]) 


RECONS.(conf[%]) 


kst 


96.82 ± 0.8 


2.15 ± 1.0 


kde 


94.56 ±0.6 


9.24 ± 1.9 


dpm 


73.56 ±4.3 


23.85 ±8.2 



novelty detection 



Table 1: Performance of kst, kde and dpm in nov- 
elty detection {novelty, left) and label reconstruction 
{recons., right) on the USPS data. The table shows 
average AuC and confusion values ± one standard de- 
viation measured over 10 randomised experiments. 



supervised model, we re-used the construction of GP 
regression and explicitly applied the kernel trick in an 
parametric unsupervised method. The GPDS method 
relies on Markov chain Monte Carlo (MCMC) meth- 
ods for so called doubly-intractable distributions [17]. 
These methods can be used to sample from the pos- 
terior over hyper-parameters of probabilistic methods 
where normalisation is intractable. In particular, they 
could be used to integrate out the hyper-parameters 
q;,/3 and ctq, of our method. We emphasise that while 
MCMC is a crucial component of the GPDS, inference 
in our model is essentially closed-form. 

4 Experiments 

A common way to examine properties of unsuper- 
vised learning models is to draw fantasy datasets from 
the generative model. The kernel studcnt-T model 
can be simulated as follows: first, we fix a;i = 
for convenience. Our generative model with shift- 
invariant kernels define an improper prior over the 
whole space when there is no observation (as it should 
since only difference between points is meaningful for 
shift invariant kernels). Then we draw each subsequent 
Xn from the kernel student-T predictive distribution 
qk{xn\xi-n~i'-, <Jo, ct, P) Conditioned on previous draws. 
Sampling from the predictive distribution is possible 
via hybrid Monte Carlo [18]. 

Fig. 2 shows example fantasy datasets for different 
settings of parameters. We observe that the genera- 
tive process has an approximate clustering property, 
in which it behaves similarly to the Chinese restau- 
rant process. We also note that the "checkerboard - 
like" eigenfunctions of the squared exponential kernel 
induce a tree like refinement of the space, in which the 
method is similar to Dirichlet diffusion trees. 

The performance of density models where the normali- 
sation constants are not available - similar examples in 
the past included Markov random fields [19] and deep 
belief nets - is hard to evaluate directly. These den- 
sity models are typically used in unsupervised tasks, 
such as reconstruction of missing data, novelty detec- 
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Figure 3: Performance of kst and kde as a function 
of the number of training examples on USPS data. 
Errorbars show ± one standard deviation. 



tion and image denoising, where the important quan- 
tity is the shape of the distribution, rather than the 
absolute density values. In the following, we consider 
three such unsupervised tasks that are indicative of the 
model's performance: novelty detection, assessment 
of reconstruction performance and the recently intro- 
duced task of relative novelty detection [20]. The pur- 
pose of these experiments is to demonstrate, as proof 
of concept, that the method is able to model relevant 
statistical properties in high dimensional data. 

4.1 Novelty detection 

Novelty detection, recognition of patterns in test data 
that are unlikely under the distribution of the training 
data, is a common unsupervised task for which density 
estimation is used. Here, we considered the problem 
of detecting mislabelled images of handwritten digits 
given a training set of correctly labelled images from 
the USPS dataset. The dataset consists of 7291 train- 
ing and 2007 test images, each of which is a labelled, 
16 X 16 gray-scale image of a handwritten digit (0-9). 

We modelled the joint distribution of the image (a 
256 dimensional vector) and the label (a ten dimen- 
sional sparse vector, where the element is 1 and 
the rest are O's if the label is I) by a kernel student-T 
density {kst). We used a squared exponential kernel 
with length-scale chosen to be the median distance be- 
tween data-points along digit dimensions and length- 
scale 1 along the label dimensions. We trained the 
algorithm on a randomly selected subset of available 



training data and calculated predictive probabilities 
on a test dataset composed of 100 correctly labelled 
and 100 mislabelled test points. Digit-label pairs with 
predictive probability under a threshold were consid- 
ered mislabelled. We compared the performance of our 
method to two baseline methods, kernel density esti- 
mation (kde, also called Parzcn window estimate) and 
Dirichlet process mixtures of Gaussians {dpm), based 
on the area under the receiver operating characteris- 
tic curve (AuC) metric. Hypcr-paramcters of all three 
methods were chosen by grid search so that the av- 
erage AuC was maximised on a separate validation 
set. Table 1 summarises the results of this comparison 
based on ten randomised iteration of the experiment 
with 2000 training and 200 test samples. We found 
that kst outperformed both competing methods sig- 
nificantly (p < 0.01, two sample T-test). The perfor- 
mance of dpm was substantially worse than that of the 
other two, which demonstrates that general mixture 
models are very inefficient in modelling high dimen- 
sional complex densities. To investigate the difference 
between the performance of kde and kst, we carried 
out a second sequence of experiments where the size 
of the training set ranged from 10 to 400. Fig. 3 shows 
average AuC values as a function of training set size. 
We observed that the two algorithms perform simi- 
larly for small amounts of training data, the difference 
in performance becomes significant {p < 0.05) for 60 
training points or more. 

4.2 Label reconstruction 

In these experiments, we considered learning the clas- 
sification of handwritten digits. In order to assess the 
density modelling capabilities of our method, we posed 
this problem as an image reconstruction problem ^. 
Again, we augmented the gray-scale image with 10 
additional 'pixels' representing a sparse coding of the 
labels 0-9. We first trained our density model on the 
augmented images on randomly selected training sub- 
sets of various sizes. Then, on a separate subset of 
test images we computed the probability of each label 
0-9 conditioned on the image by computing the joint 
probability of the image with that label and renormal- 
ising. For each image we assigned the label for which 
the conditional predictive probability was the high- 
est (maximum a posteriori reconstruction). Results 
of comparison to kde and DPM with 2000 training 
and 400 test points are shown in Fig. 3. As measure 
of performance, we used average confusion, i.e. the 
fraction of misclassified test images. We again found 
that kst outperformed both other methods, and now 

^We could have considered more general image recon- 
struction tasks, but reconstructing labels equips us with 
intuitive measures of loss, such as confusion and AuC. 



the advantage of our method compared to kde is more 
substantial. Fig. 3 shows confusion of kst and kde as a 
function of the number of training examples. The dif- 
ference between the two methods becomes significant 
(p < 0.01) for 60 training examples or more. 

4.3 Relative novelty detection 

Finally, we considered the problem of relative novelty 

detection [20], for which we rc-uscd the experimen- 
tal setup in [20] . The data consists of two 200 x 200 
color satellite images, the background and the target 
(Fig. 4.A-B). The task is to detect novel objects on 
the target image relative to the background image. 
Treating RGB values in each non-overlapping 2x2 
patches as data points, we estimated the density over 
such patches in both the background {q{x)) and in 
the target {p{x)) images. A patch x was identified 
as relatively novel if the ratio p{x)/q{x) was above a 
certain threshold. The results produced by our algo- 
rithm are similar to that obtained by using state of the 
art novelty detection algorithms (c. f. Fig. 2 in [20]), 
even though here we only used a 3% random subsam- 
ple of available image patches to estimate target and 
background densities. 

5 Discussion 

Summary In this paper, we considered applying the 
kernel trick as a tool for constructing nonparametric 
Bayesian methods, and suggested a general recipe for 
obtaining novel analytically tractable Bayesian kernel 
machines. The starting point is a suitably simple para- 
metric Bayesian model in which inference is tractable. 
We have demonstrated how arguments based on or- 
thonormal invariance can be used to guide our choice 
of models that are amenable to kernel trick. Having 
defined the basic model, one needs to express posterior 
predictive distributions in terms of kernel evaluations, 
then apply the kernel trick in those equations. By 
using kernels with infinite dimensional feature spaces, 
novel tractable nonparametric Bayesian methods may 
be obtained in this way. 

To ilhistratc our general approach, we studied the 
problem of density estimation and presented kernel 
student-T density estimation, a novel Bayesian kernel 
machine for density modelling. We followed our gen- 
eral methodology to arrive at a closed form expression 
that is exact up to a multiplicative constant. We also 
carried out numerical experiments to investigate the 
performance of the method. The results demonstrate 
that the method is capable of estimating the distribu- 
tion of high dimensional real-world data, and can be 
used in various applications such as novelty detection 
and reconstruction. The most apparent limitation of 




Figure 4: A, Background and B, target images for relative novelty detection. C, Top 5% novel patches in the 
target image are highlighted in red. The results are qualitatively similar to those presented in [20, Fig. 2]. 



the kernel student-T density model is the intractabil- 
ity of its normalisation constant. Nevertheless, the 
unnormalised density can still be used in a variety of 
unsupervised learning tasks, such as image reconstruc- 
tion or novelty detection. As other kernel methods, our 
algorithm has cubic complexity in the number of train- 
ing points, though sparse approximations [21] can be 
used to construct faster algorithms for large-scale ap- 
plications. Another advantage of the kernel approach 
is that the expressions can be formally extended to de- 
fine non-trivial discrete distributions over e. g. graphs, 
permutations or text. 

Conclusions and future work We believe that the 
present paper opens the door for developing a novel 
class of Bayesian kernel machines. The methodology 
presented in this paper can be applied in further prob- 
lems to obtain nonparametric methods based on para- 
metric Bayesian models. Investigating merits and lim- 
itations of this general approach in various statistical 
estimation and decision problems is certainly an in- 
teresting theme for future research. One particularly 
exciting direction that we plan to pursue is a Bayesian 
two-sample hypothesis test analogous to the frequen- 
tist test based on characteristic kernels [14]. For hy- 
pothesis testing, one needs to apply the kernel trick to 
a ratio of marginal likelihoods, rather than to a pre- 
dictive density, which enables cancellations which sim- 
plify the issues with normalisation or Jacobians. Other 
potentially fruitful directions include semi-supervised 
regression and Bayesian numerical analysis [22]. 
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A Change-of-variable formula 

Theorem Let X and he d and D dimensional Euclidean spaces respectively {D > d), : X he a, 
smooth injective mapping and let q{4>) define a density function with respect to the Lebesgue measure on T. 
Then the restriction of distribution q to the observation manifold <f{X) (i. e. conditioning on the event of being 
in the observation manifold) induces the following density in the input space X: 

We suspect this result may have appeared in several forms in the literature, but we didn't succeed to find a 
citation, and so we provide a proof here. 

Proof Note that we cannot apply the standard change-of-variable formula directly as the two spaces have 
different dimensions. So we augment the space X with an Euclidean space S of dimension D — d. We then define 
the following mapping F : X x E 

F{x,^) = ipix) + V^i, 

where Vx is a, D x {D — d) matrix of mutually orthonormal vectors to the tangent space of the manifold at x 
(i. e. a basis for the orthogonal space to the tangent space at x). For small enough ^'s (which could depend on 
a;), F will map small neighborhoods of (a;,0) to neighborhoods of ip{x) in an injective manner. We thus apply 
the standard change of variable formula to map the density q{4>) to a density on an open set around X x {0}: 

p{x,^) = q{F{x,0)\detJF{x,^)\, 

where Jf{x, ^) is the Jacobian matrix of the F mapping. To condition on being in the observation manifold, we 
simply need to condition on ^ = 0, and so the density we want to compute is: 

p{x) oc p{x, 0) = q{F{x, 0)) |det Jpix, 0)| . 

Now, Jf{x, 0) = {Jip{x),Vx) where J^{x) = ( ^g^i^ , ■ ■ ■ , ^g^d ^ ) is the Jacobian matrix of if. Finally, we use the 
fact that I det(JF)| = (det(JjJir))^/^ for any squared matrix Jf to get that: 

|det(Jf(a;,0))|=det(^ M^V^M^) J ^ = det {J^{xf J^{x)y\ 

where we used that Jy(a;)^14 = and VjVx = I hy orthonormality of the Vx with the tangent space spanned 
by the columns of J^{x). Finally, we use the fact that: 



to obtain (7). □ 



We provide the sketch of a proof that ( , ) doesn't depend on x for a translation invariant kernel under 



B Jacobian term for translation invariant kernel 

suitable regularity conditions. 

By Bochner's theorem (e.g. theorem 4.1 in [2]), a translation invariant kernel can be expressed as the inverse 

Fourier transform of a positive finite measure. If moreover we assume that the Fourier transform of the kernel 
exists, this measure will have a density p{s) with respect to the Lebesgue measure, and so we can write: 

k{x-y)= [ p(s)e2^""^('=-f)rfs 

= J p{s) (cos(27rs^(a; — y)) + isin{2ns^ {x — y))) ds 

= J p{s) (cos(27rs^(x — y))) ds 

as sin(s^5) is an odd function of s for any 5 and p{s) is a symmetric function given that it is the inverse Fourier 
transform of the symmetric function k, so the Cauchy principal value of the second term vanishes. Now using a 
trigonometric identity, we get: 

kix - y) = jpi^s) ((cos(2..-.) cos(2..-,)) + (sin(2.3-.) sin(2.a-,))) ds 

f I rr\ ( cos(27rs"^a;) \ ^-—^ ( sin(27rs"ry) \ \ 
= I \^ isin(2.sTx) ) ' ^ Uin(2..Ty) J / ^s 

So we can obtain an explicit feature mapping by defining: 

^(")=i^isin(2..T,) 
Assuming that p{s) goes to zero sufficiently quickly, we have that: 



s6l 



df{x) 



/— sin(27rs^a;) 
2^^Sk^/p(^){ ^ / 



dxfc \ \ cos(27rs^a;) 

and using a similar derivation as we used above, we get that: 

= 4-77^ J SkSip{s)e°ds 
which doesn't depend on x as we wanted to show. □ 



